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ABSTRACT 


'°Th«  procedures  for  analog  Monte  Carlo  simulation  of 
Markov  processes  are  examined.  Two  variance  reduction 
techniques  are  then  Included  In  a  nonanalog  formu¬ 
lation  to  Increase  the  sampling  efficiency  for  highly 
reliable  systems,  and  a  method  for  Incorporating 
uncertainty  In  failure  and  repair  rate  data  Is 
outlined.  Models  for  three  classes  of  component 
dependencies  appearing  In  reliability  and  availability 
problems  are  Incorporated  Into  the  Markov  formula¬ 
tion.  They  are  (1)  shared  repair  crews  between 
components,  (2)  load  sharing  between  components,  and 
(3)  standby  mode.  Results  are  given  for  a  aeries  of 
model  problems  to  demonstrate  the  efficiency  of  the 
methods  as  well  as  the  effects  of  the  dependencies  on 
system  unreliability  and  unavailability. 


r\ 


I  INTRODUCTION 

The  estimate  of  reliability  and/or  availability 
of  systems  with  large  numbers  of  components  In  highly 
redundant  configurations  frequently  Is  required  In  the 
analysis  of  a  variety  of  safety  and  protection  sys¬ 
tems.  The  analysis  most  often  proceeds  by  first 
constructing  a  fault  tree  [1,2]  to  determine  the 
possible  combinations  of  primary  component  failures 
that  will  result  In  system  failure.  Then  the  relia¬ 
bility  or  availability  of  the  system  Is  estimated 
quantitatively  In  terms  of  the  failure  and  repair 
rates  of  the  components. 

The  quantitative  evaluation  may  be  carried  out  by 
standard  deterministic  methods,  provided  the  probabi¬ 
lities  of  component  failure  and  repair  are  mutually 
Independent.  However,  the  task  becomes  prodigious  in 
the  presence  of  component  dependencies  such  as  appear 
with  backup  systems,  repair  crews  shared  between 
components,  or  comporaents  with  shared  loading.  In 
such  situations  the  system  may  be  modeled  aa  a  Markov 
process  [3-5J,  for  then  such  dependencies  can  be 
properly  represented,  provided  the  failure  and  repair 
rates  are  tlrae-lndflpendent .  It  then  follows,  however, 
that  a  system  of  2  coupled  first-order  differential 
equations  must  be  solved  numerically  If  a  system  with 
N  components  Is  to  be  modeled.  Since  this  represents 
over  a  million  differential  equations  for  a  system  of 
Permanent  Address:  Beijing  Nuclear  Engineering  Institute 
P.  0.  Box  840  Beijing 
People's  Republic  of  China 
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only  twenty  components,  numerical  Integration  of  other 
deterministic  approaches  rapidly  becomes  Intractable. 

Markov  Moot#  Carlo  sampling  [6,7]  provides  a 
means  of  clrcuaventlng  the  detailed  numerical  solution 
of  the  coupled  differential  equations  that  can  be  very 
effective,  provided  only  a  few  Integral  system  para¬ 
meters,  such  as  reliability  and  availability,  are  to 
be  estimated.  Moreover,  for  highly  reliable  systems 
for  which  analog  Monte  Carlo  sampling  la  very 
inefficient,  powerful  variance  reduction  schemes  [6] 
may  be  Incorporated  Into  Markov  Monte  Carlo  to 
Increase  computational  efficiency  by  orders  of  magni¬ 
tude.  Finally,  the  large  uncertainties  that  are 
Inherent  to  moat  failure  and  repair  rate  data  may  be 
taken  Into  account  systematically  such  that  the 
variance  of  Che  result  can  be  divided  Into  that 
associated  with  data  uncertainty  and  that  due  to  the 
finite  number  of  monte  Carlo  trials. 

In  what  follows  we  first  set  forth  the  basis  for 
analog  Monte  Carlo  simulation  of  Markov  processes. 

Two  variance  reduction  techniques,  forced  failure  and 
failure  biasing,  are  then  Introduced  no  Increase 
computational  efficiency.  The  treatment  of  data 
uncertainty  Is  examined,  and  the  paper  la  concluded 
with  numerical  results  for  a  series  of  example 
problems  that  differ  by  the  types  of  component 
dependencies  that  are  taken  Into  account. 

2.  ANALOG  MONTE  CARLO 

Each  of  the  2  Markov  states  for  an  N  component 
system  Is  defined  by  a  unique  combination  of  function¬ 
al  and  failed  components.  The  probability  density 
function  that  a  system  la  state  k*  at  time  t'  trill 
make  a  state  transition  at  time  t  la 

f(t|t’,k*>  -  expf— ,  (t  -  t’)I 

where  y^.  Is  the  transition  rate  out  of  state  k*. 

In  the  case  where  there  are  no  component  dependen¬ 
cies,  vk  Is  the  sum  of  the  failure  rates  of  the 
functional  components  and  the  repair  rates  of  the 
failed  components.  In  the  event  that  there  are 
component  dependencies,  certain  of  the  failure  or 
repair  rates  will  depend  on  the  states  of  other 
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component*.  To  (ample  the  time  Interval 
at  *  t  -  t'  between  state  transitions  we  obtain  the 
cumulative  probability  distribution  corresponding  to 
(1),  set  It  equal  to  a  random  number  C,  that  la 
uniformly  distributed  between  tero  and  one.  and  Invert 
to  obtain  (8] 

at  -  -  1 —  ln(l  -  C). 

V 

Having  determined  the  time  of  transition,  we  must 
determine  the  new  state  of  the  system.  The  transition 
probability  q<k|k')  from  state  k*  to  state  k  Is  Just 
1/y  ,  multiplied  by  the  failure  or  repair  rate  of  the 
component  that  must  change  states  In  order  for  the 
k*  ♦  k  transition  to  take  place.  To  sample  the 
transition,  a  uniformly  distributed  random  number 
£'  Is  generated  and  the  new  state  Is  found  by 
determining  the  value  of  k  that  satisfies  the 
Inequality 

k  k+1 

l  q(k-|k’)  <  V  <  l  q(k'|k' ). 

k"*0  k"»o 

The  most  general  version  of  the  Harkov  code 
estimates  either  unreliability  and  unavailability  from 
H  Independent  trials.  Each  trial  consists  of 
alternate  samplings  of  the  times  of  the  successive 
transitions  and  the  state  of  the  system  after  the 
transitions.  For  unavailability  calculations  the 
trial  ends  when  the  design  life  T  la  exceeded.  For 
unreliability  calculations  the  trial  ends  at  the  first 
system  failure  or  when  T  la  exceeded.  After  each 
transition,  the  system  status  Is  checked  to  determine 
whether  the  system  Is  In  an  operational  or  a  failed 
state.  Tallies  are  required  only  after  transitions 
for  which  there  Is  a  change  In  system  status  between  a 
failed  and  an  operational  state. 

For  purpose*  of  determining  its  status,  the 
system  Is  represented  by  an  equivalent  fault  tree. 

The  status  Is  then  determined  either  by  comparing  the 
combination  of  failed  components  to  the  tree's  minimal 
cut  sets,  which  have  previously  been  determined,  or  by 
direct  bottom-up  logical  evaluation  of  the  tree  |l). 

In  analog  Monte  Carlo  the  unreliability  tally  la 
binomial,  consisting  of  a  one  for  each  trial  resulting 
In  system  failure  before  t  ■  T,  and  zero  otherwise. 

For  the  unavailability,  the  tally  Is  Just  the  .'action 
of  the  time  T  for  which  the  system  Is  In  a  failed 
state.  Thus  If  H  trials  are  carried  out,  and  m  of 
them  are  found  to  result  In  system  failure,  the 
binomial  estimator  for  the  unreliability  is 

-  n/M, 

and  the  corresponding  sample  variance  Is 

s’<  V  ’  ik  V1  -  V- 

Likewise,  If  t  /T  Is  the  fraction  of  the  tlm^T  that 
the  system  Is  fn  a  failed  state  during  the  m  trial, 
the  unavailability  Is  estimated  by 


-  I  t  /T 
a-1 

with  *  sample  variance  cf 


$2(Ua>  -JM  "<f*-  V2* 

m-1 


Then,  according  to  the  central  limit  theorem_the  681 
confidence  Interval  for  the  results  is  UTS//M. 


VARIANCE  REDUCTION 

Even  with  the  Improved  sampling  that  results  from 
the  Lagranglan  approach  of  the  Harkov  simulation, 
analog  Honte  Carlo  analysis  of  highly  reliable  systems 
Is  likely  to  be  very  expensive,  since  only  very  rarely 
will  a  trial  contribute  a  nonzero  tally.  Fortunately, 
powerful  variance  reduction  techniques  are  easily 
adapted  to  the  Harkov  Monte  Carlo  formalism.  These 
techniques,  which  are  analogous  to  the  highly  refined 
importance  sampling  methods  employed  In  neutron  trans¬ 
port  calculations  |8-10),  greatly  Increase  the  compu¬ 
tational  efficiency  for  highly  reliable  systems, 
without  biasing  the  results. 

We  employ  two  such  techniques,  which  we  refer  to 
as  forced  transitions  and  failure  biasing.  In  these 
the  sampling  distributions  are  modified,  first  to 
produce  an  artificially  large  number  of  component 
transitions,  and  second  to  Increase  the  ratio  of 
failures  to  repairs.  To  each  trial  la  attached  a 
weight.  Initialized  to  one,  that  Is  modified  appro¬ 
priately  each  time  a  biased  sampling  distribution  Is 
used.  Then  by  defining  weighted  tallies,  unbiased 
estimators  may  be  shown  to  result. 

In  forced  transitions,  we  replace  f(t|t',k*)  by 


f(t|t’,k')  - 


-Y(t-t') 

ye 

-y(T-t') 


t'  <  t  <  T 


otherwise 


With  this,  the  uniformly  distributed  random  number 
E  can  be  used  to  sample  the  Interval  to  the  next 
transition: 


~  *n  [1  -  £|l-e 


-y(T-t’) 


),  0  «  At  <  T  -  t*. 


causing  the  next  tranoltlon  to  be  forced  before  the 
end  of  design  life.  To  compensate  for  the  modified 
sampling  the  trial  weight  w  If  modified  by 


w  *  ».|X  -  e~Y(T_t,)J. 


This  technique  Is  only  applied  when  y(T-t')  Is  small, 
and  therefore  there  Is  only  a  small  chance  of  an  ana¬ 
log  transition  before  the  end  of  design  life.  When 
there  are  large  transition  rates,  due  for  example  to 
the  presence  cf  large  repair  rates  In  y,  then  analog 
sampling  Is  used.  Conversely,  when  y(T-t')  Is  very 
small,  a  rare  event  approximation  may  be  used  to 
simplify  the  codified  probability  density  function  to 


f ( 1 1 1 '  ,k' )  - 


(T-t\>  ' 


t'<  t  <T. 


Therefore  At  can  be  sampled  from  a  uniform 
distribution,  and  the  weight  Is  modified  by 

w  *  y<T  -  t')w  . 

In  failure  blsslng  Che  transition  probabilities 
q(kjk’)  are  modified  to  increase  the  ratio  of  failures 
to  repairs.  This  Is  necessary  for  efficient  computs-  *■ 
tlon,  since  normally  the  repair  rates  are  an  order  of 
magnitude  or  more  larger  than  the  failure  rates.  The 
biased  transition  probabilities  are  taken  as 


q(k|k  }  l  q(k-|k’>  *' 
k"cA 


1  VA'VjTim 


q(k|k*)  “ 


q<h*|k*) 


Her*  A  le  used  to  indicate  the  eet  of  treneltlon*  out 
of  etete  k'  that  reeult  from  component  failure* ,  and  R 
Indicate*  thoae  that  reault  from  repali*.  Theae 
bleaed  ptobabllltle*  ere  defined  euch  that  a  fraction 
x  of  the  tranaltlon*  ere  forced  to  be  failure*. 
Typically,  take  a  to  be  at  leaet  one  half,  which  1* 
■och  larger  than  normally  would  be  the  caae  with 
unbiased  sampling.  To  maintain  unbiased  reault*  the 
trial  weight  la  modified  by 

w  ♦  w  l  *iCk“|k * ) 

S'cA 


lmpTOvem* its  of  more  than  three  orders  of  magnitude  or 
■tore  are  obtainable  ( 6 , 7 J . 

4.  DATA  DISTRIBUTIONS 

The  foregoing  reault*  are  for  polot  data  on  the 
failure  and  repair  rate*.  Data  uncertainty  can  also 
be  factored  into  the  calculetlona  In  a  rational  way. 
Each  failure  or  repair  rate  1*  represented  by  e  proba¬ 
bility  denelty  function  which  1*  eaopled  at  the  begin¬ 
ning  of  a  batch  of  M  historic*.  The  calculation  1* 
then  repeated  for  N  hatches  end  the  unreliability  la 
determined  from 

•*  •  »i, 


for  component  failure  and 

qU’lk’) 

Tt"tR 

for  repair. 

In  uelng  these  variance  reduction  techniques,  the 
trial  weight  la  appropriately  modified  at  each  biased 
sampling  until  the  flre^aystem  failure  occur*.  The 
weight  w  £  w^  for  the  a  history  1*  then  tabulated, 
the  variance  reduction  techniques  r.  e  turned  off  for 
the  remalndet  of  trial,  and  the  trial  la  completed 
uelng  analog  Harkov  Monte  Carlo.  This  combination  of 
biased  and  analog  sampling  la  found  to  be  Ideal,  for 
It  results  In  a  substantial  fraction  of  the  trial* 
contributing  nonzero  tallies  to  the  results.  At  the 
same  time  the  tally  procedure  for  the  unavailability 
la  Simplified,  and  one  does  not  encounter  the  problem 
of  very  long  trail*,  with  Insignificantly  small 
welghtE  resulting  from  many  system  failures. 

With  the  foregoing  Importance  sampling 
procedures,  the  estimate  for  the  unreliability  la  Just 

u  -  A  y  w  . 
r  M  S  m 
IB-1 

The  sampling  variance  la  then 


s2(V  VW«  '  V2* 

m-1 


The  estimate  lor  the  unavailability  is 


Likewise,  for  the  unavailability 
1  K 

U.  -  £  l  U.n*  | 

n»l 

th  I 

where  l^n  la  the  unavailability  estimate  for  the  n 
batch.  The  corresponding  sample  variances  are  given 

by  j 

s2<u )  ■  i  i  d  2  -  i)J 

r  H  *■,  m  r 
n-1 

and 

S2<V-£l  Ujn-U2 

n-1  j 

An  important  point  is  that  variance  of  either  result  i 
may  be  cast  In  the  form  1 8]  j 

SJ  •  J  A  +  B,  i 

I 

provided  that  M,  the  number  ol  trials  per  batch,  Is  ! 
large  enough  that  the  batch  averages  form  a  normal  j 

distribution.  The  first  term  Is  due  to  the  finite  | 

number  of  histories  per  batch,  and  the  second  is  due  j 
to  the  date  uncertainty*  Thus  the  batch  elzeF  need  be 
made  only  large  enough  that  the  second  term  domi~  j 

nates.  For  then  the  variance  In  the  result  la 
governed  by  the  data  uncertainty  and  not  the  finite 
batch  size.  In  reliability  and  availability  problems, 
where  the  data  uncertainties  are  often  substantial, 
quite  moderate  batch  sires  often  meet  this  criterion* 


u  -  1 

a  H  *  T 


where  t  /T  is  Just  the  fraction  of  the  m  trial  for 
which  tRe  eystem  Is  In  a  failed  state.  For  the 
unavailability,  the  sample  variance  Is 


■*<»*>•£ K  v2* 


The  positive  effects  of  the  variance  reduction 
techniques  on  computational  efficiency  are  most 
pronounced  for  highly  reliable  systems.  The  im 
provement  say  be  measured  in  terms  of  the  standard 
figure  of  merit  for  Monte  Carlo  calculations: 
l/S2?,  where  t”  la  the  time  per  history,  Is  a  standard 
measure  of  Monte  Carlo  computational  efficiency. 


5.  RESULTS 

In  illustrating  the  effects  of  component 
dependencies  on  system  unreliability  and  una¬ 
vailability  a  model  problem  is  defined  by  the  fault 
tree  shown  in  Fig.  1.  This  problem  has  been  studied 
In  the  absence  of  component  dependencies  using  both 
deterministic  [11)  and  Markov  Monte  Carlo  16]  merhods. 
The  component  failure  and  repair  rates  are  given  in 
Table  1.  System  failure  is  determined  by  bottoi-up 
logical  evaluation  of  Che  tree  [1]  after  each 
t  ransl tion. 

For  purposes  of  dependency  modeling  the  compo¬ 
nents  are  divided  Into  four  groups  ms  Indicated  in 
Table  1.  Using  f ixed‘ f ai lure  and  repair  rate  data,  we 
examine  each  of  the  dependency  types  Individually.  In 
all  cases  we  calculate  unreliability  using  a  design 
life  of  T  -  1,000  hours,  and  in  the  failure  biasing  * 
value  of  s  -  0,9  1*  used.  All  quoted  tines  are  or  a 
CYBER  20b  without  vec l ori rat  I  on , 


j 


In  Table  2  are  shown  the  effect*  of  shared  repair 
crews  on  the  unavailability  and  unreliability.  In  the 
base  case  there  are  no  dependencies  since  each  of  the 
components  has  a  repair  crew.  Note  that  the  great 
Increases  In  unreliability  and  unavailability  in  the 
problem  result  In  godng  from  two  to  one  repair  crew 
per  set,  while  practically  no  deterioration  results 
from  going  from  one  to  none. 

In  Table  3  results  are  shown  for  the  case  where 
within  each  of  the  four  component  groups  there  Is  load 
sharing,  such  that  the  failure  of  any  one  component 
within  the  group  increases  the  failure  rate  of  the 
remaining  components  by  AX, 

In  Table  4  results  are  given  for  standby 
configurations.  Component  groups  1  and  2  ere  (2/3) 
systems  with  the  third  component  held  In  standby, 
while  component  groups  3  and  4  art  (1/2)  standby 
systems.  The  effects  of  various  levels  of  the  standby 
failure  rate,  X*.  and  of  a  switching  failure 
probability  of  51  are  shown.  In  all  cases  p  ,  the 
repair  rate  for  switching  failures,  is  taken*to  be  0.5 
hr 

Finally,  In  realistic  calculations  the  data 
uncertainty  19  likely  to  be  large,  contributing  a 
variance  to  the  results  that  may  be  substantially 
larger  than  that  due  to  the  fact  that  only  a  finite 
number  of  Monte  Carlo  trials  have  been  run.  To 
demonstrate  this  effect  the  system  without  component 
dependencies  Is  run  with  data  uncertainties.  For  each 
component  the  failure  and  repetr  rates  are  represented 
by  lognormal  distributions.  Foi  all  components  the 
uncertainty  factor  In  both  failure  and  repair  rates  Is 
set  equal  to  three.  The  results  arc  compared  In  Table 
5.  In  the  calculations  without  uncertainty  10,000 
trials  were  used.  With  uncertainty,  1,000  batches  of 
25  trials  per  data  batch  were  employed.  With  this 
level  of  uncertainty  It  la  seen  that  roughly  three 
quarters  of  the  uncertainty  In  the  results  Is  due  to 
dats  uncertainty,  and  not  the  limited  number  of  Monte 
Carlo  trials.  The  results  are  altered  very  little  If 
increased  numbers  of  batches  are  used. 

Unless  otherwise  specified  the  foregoing 
calculations  each  consist  of  10,000  Monte  Carlo 
trials.  For  the  example  problems  studied  here  the 
presence  of  dependencies  never  increases  the  running 
time  by  as  much  as  a  factor  of  two.  Without  data 
uncertainty  running  time  Is  roughly  ten  thousandths  of 
s  second  per  trial  or  a  few  seconds  per  run.  With 
dsts  uncertainty,  times  of  less  than  one  hundredth  of 
a  second  per  batch  of  25  typically  are  required.  The 
code  Is  also  capable  of  treating  problems  In  which  all 
three  dependency  types  are  present,  a-;.  In  which  data 
uncertainty  Is  Included. 
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■Table  1.  Dats  for  example  problem 
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Table  2.  The  Influence  of  Che  Nuober  of  Repair  Crave 
on  Unreliability  and  Unavailability 


Nuaber  of 

repair  crewe  N 

Unreliability 

Unavailability 

Croup  1 

Croup  2 

0 

0 

.351?*lCZ  i*.l05 

.23S5*10‘24*.io‘5 

1 

0 

•3S17*10’2i4.io‘S 

.23*1*10‘24*.io"5 

0 

1 

.JS15*10“244.10'S 

.2166-10"2i*.io"5 

1 

1 

.3515*10~244.10“5 

•2263.10-24*.io"5 

2 

1 

.3*92*10“2i*.io“S 

•22t0.10“24*.l0~5 

1 

2 

.6323-10~4S5.io“6 

2 

2 

.4394.10_4i5.lo"7 

.H36-10-615*10~9 

3 

3 

.4394*10~4t5.10_7 

.1436.10~fc±S.lo"9 

Table  3.  The  Influence  of  Sharing  Load  or 

Unreliability  and  Unavailability 
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0.2 

.5278.1O"4i5.l0~7 

.1B0CM0'6;+7.10~9 

...  .  -4  -7 

0.5 

.6584*10  +6*10 

.2209.10  17-10 

0.8 

.TSIS.IO'Si.IO*7 

.2622.10'4+B.1o'9 

1.0 

.B778-10_4i7.io'7 

.2B99.10'6JB.1o'9 

2.0 

.I319.10"3ii.io”6 

. 4 260* 10”^+9. jo”9 

